WIERT: Web Information Extraction via Render Tree

نویسندگان

چکیده

Web information extraction (WIE) is a fundamental problem in web document understanding, with significant impact on various applications. Visual plays crucial role WIE tasks as the nodes containing relevant are often visually distinct, such being larger font size or having brighter color, from other nodes. However, rendering visual of page can be computationally expensive. Previous works have mainly focused Document Object Model (DOM) tree, which lacks information. To efficiently exploit information, we propose leveraging render combines DOM tree and Cascading Style Sheets (CSSOM) contains not only content layout but also rich at little additional acquisition cost compared to tree. In this paper, present WIERT, method that effectively utilizes based pretrained language model. We evaluate WIERT Klarna product dataset, manually labeled dataset renderable e-commerce pages, demonstrating its effectiveness robustness.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Personalizing Web Publishing via Information Extraction

because Web search and navigation are still underdeveloped. Although Web publishing is increasingly successful, it still requires too much time and effort to precisely locate specific information. This process is often tied to traditional solutions developed outside the Web scenario—for example, information retrieval (IR) models over hypertext rather than simple text documents. Moreover, even d...

متن کامل

Learning n-ary tree-pattern queries for web information extraction

The problem of extracting information from the Web consists in building patterns allowing to extract specific information from documents of a given Web source. Up to now, most existing techniques use string-based representations of documents as well as string-based patterns. Using tree representations naturally allows to overcome limitations of string-based approaches. While some tree-based app...

متن کامل

W Web Information Extraction

Information extraction (IE) is the process of automatically extracting structured pieces of information from unstructured or semi-structured text documents. Classical problems in information extraction include named-entity recognition (identifying mentions of persons, places, organizations, etc.) and relationship extraction (identifying mentions of relationships between such named entities). We...

متن کامل

Personalized Web Services for Web Information Extraction

The field of information extraction from the Web emerged with the growth of the Web and the multiplication of online data sources. This paper is an analysis of information extraction methods. It presents a service oriented approach for web information extraction considering both web data management and extraction services. Then we propose an SOA based architecture to enhance flexibility and on-...

متن کامل

Web Information Extraction Systems for Web Semantization

In this paper we present a survey of web information extraction systems and semantic annotation platforms. The survey is concentrated on the problem of employment of these tools in the process of web semantization. We compare the approaches with our own solutions and propose some future directions in the development of the web semantization idea.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i11.26546